Statistical clustering techniques in historical English linguistics
نویسندگان
چکیده
ed 1 0 1 0 0 0 0 ... ... ... ... ... ... ... ... The information in Table 2 can be used to measure similarity between adjacent periods: Each period is represented not by a single frequency value but in fact by a vector of frequency values for the past participles; relative similarity between adjacent periods can be computed with a statistic such as Pearson's correlation r. The first step of VNC is to find the two periods that correlate best, i.e., whose correlation coefficient is highest. In this case, periods two (1700-49) and three (1750-99) exhibit the highest correlation. The correlation between their collocate sets is shown in the left panel of Figure 4, in which it can be seen that especially the more frequent past participles correlate rather well. These two periods are then merged, i.e. a new period is created that holds the respective mean values for each past participle (1 for abandoned, 2 for abated, 1 for abolished, etc.). The dendrogram in the right panel of Figure 4 results from successive mergers of this kind. The dendrogram allows several ways to partition the ARCHER data, but more importantly, it also makes clear that certain default periodizations would be problematic. For instance, cutting up the data into century-length chunks would obscure the relatively greater differences that are found in the 1800s and 1900s. By contrast, if the data were to be used for a binary contrast of before and after, it would make sense to split the data by the year 1900. Especially for a distinctive collexeme analysis, that would be a reasonable option to bring out collocational differences in the development of the English passive. 6. Detection of outliers with Variability-based Neighbor Clustering The final application of VNC to be mentioned here is the detection of outliers in fine-grained historical data. This is particularly useful in the analysis of year-by-year data that can exhibit substantial fluctuation. For example, Gries & Hilpert (2010) study the decline of the third person singular -(e)th in English. Between the early 15th century and the late 17th century, forms such as
منابع مشابه
VARD 2: A tool for dealing with spelling variation in historical corpora
Spelling variation causes considerable problems for corpus linguistic techniques such as frequency analysis, concordancing and automatic tagging, with a significant impact being made on recall and the accuracy of results [1]. This paper will focus on Early Modern English, the most recent period of the English language to include a large amount of inconsistent spelling. Although many corpora of ...
متن کاملA Mathematical Model of Historical Semantics and the Grouping of Word Meanings into Concepts
A statistical analysis of polysemy in sixteen English and French dictionaries has revealed that, in each dictionary, the number of senses per word has a near-exponential distribution. A probabilistic model of historical semantics is presented which explains this distribution. This mathematical model also provides a means of estimating the average number of distinct concepts per word, which was ...
متن کاملTagging Historical Corpora - the problem of spelling variation
Spelling issues tend to create relatively minor (though still complex) problems for corpus linguistics, information retrieval and natural language processing tasks that use ‘standard’ or modern varieties of English. For example, in corpus annotation, we have to decide how to deal with tokenisation issues such as whether (i) periods represent sentence boundaries or acronyms and (ii) apostrophes ...
متن کاملFinding Groups in Chronologically-Ordered Corpus Data: Variance-Based Neighbor Clustering
Much corpus-linguistic research is concerned with the development of particular parameters over time. For example, in L1/L2 acquisition/learning, the syntactic development of a child/learner is approximated on the basis of how mean lengths of utterances (MLU), t-unit-based measures, or IPSyn values change over time (cf. Shirai and Andersen 1995 or Ortega 2003). Similarly, in historical linguist...
متن کاملMultilingual Metaphor Processing: Experiments with Semi-Supervised and Unsupervised Learning
Highly frequent in language and communication, metaphor represents a significant challenge for Natural Language Processing (NLP) applications. Computational work on metaphor has traditionally evolved around the use of hand-coded knowledge, making the systems hard to scale. Recent years have witnessed a rise in statistical approaches to metaphor processing. However, these approaches often requir...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010